[PyTorch][Core][JAX] Expand troubleshooting docs#2602
[PyTorch][Core][JAX] Expand troubleshooting docs#2602jberchtold-nvidia wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
Greptile OverviewGreptile Summary
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant UV as uv/venv
participant Pip as pip/uv pip
participant Build as TE build (PEP517)
participant TE as transformer_engine
participant JAX
User->>UV: Activate virtual environment
User->>Pip: Install TE (uv pip install --no-build-isolation ...)
Pip->>Build: Build TE without isolation
Build-->>Pip: Wheel / install artifacts
Pip-->>TE: Importable package in venv
User->>TE: Run workload
alt cuDNN sublibrary loading failed
TE->>TE: dlopen cuDNN libs
TE-->>User: CUDNN_STATUS_SUBLIBRARY_LOADING_FAILED
User->>UV: Ensure venv cuDNN packages used
User->>Build: Set CUDNN_PATH/CUDNN_HOME/LD_LIBRARY_PATH
end
alt JAX FFI not registered
TE->>JAX: Register custom calls during init
JAX-->>User: No registered implementation for custom call (CUDA)
User->>Pip: Reinstall/build with --no-build-isolation
end
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
|
|
||
| .. code-block:: bash | ||
|
|
||
| export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn |
There was a problem hiding this comment.
style: hardcoded Python version may not work for all users - consider using a generic placeholder like pythonX.Y or explaining users should adjust this
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
|
|
||
|
|
There was a problem hiding this comment.
style: extra blank line - RST should have only one blank line before code blocks (see lines 305-306 for consistent formatting)
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com>
|
/te-ci |
|
/te-ci |
| @@ -315,6 +315,37 @@ Troubleshooting | |||
| cd transformer_engine | |||
| pip install -v -v -v --no-build-isolation . | |||
|
|
|||
There was a problem hiding this comment.
RST list nesting broken
The new troubleshooting section uses 1./2. numbered items with nested bullet * **Symptoms:** / * **Solution:** lines, but there’s no blank line separating the list item from the nested bullet list. In reStructuredText this often breaks nesting/formatting (the * bullets can get treated as literal text or start a new top-level list). Add a blank line after each numbered item title (e.g., after 1. **Import Error:**) before the indented * bullets, and likewise for the JAX section.
Also appears at README.rst:325, README.rst:338, and README.rst:346 (same pattern).
| .. code-block:: bash | ||
|
|
||
| export CUDNN_PATH=$(pwd)/.venv/lib/python3.12/site-packages/nvidia/cudnn |
There was a problem hiding this comment.
Extra blank lines
There are two blank lines before the .. code-block:: bash directive. In RST, extra blank lines inside list items can cause the directive to detach from the list item and/or render with unexpected spacing. Reduce to a single blank line before the directive so it stays correctly nested under the Solution: bullet.
Description
Expand the troubleshooting installation docs with a few recently debugged issues.

Type of change
Changes
uvvenvs and JAX-specific issue symptoms.Checklist: